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v  Abstract 

H 

Shared-memory  multiprocessors  have  received  wide  attention 
in  recent  times  as  a  means  of  achieving  high-performance 
cost-effectively.  Their  viability  requires  a  thorough  under¬ 
standing  of  the  memory  access  patterns  of  parallel  process¬ 
ing  applications  and  operating  systems.  This  paper  reports 
on  the  memory  reference  behavior  of  several  parallel  applica¬ 
tions  running  under  the  MACH  operating  system  on  a  shared- 
memory  multiprocessor.  The  data  used  for  this  study  is  de¬ 
rived  from  multiprocessor  address  traces  obtained  from  an 
extended  ATUM  address  tracing  scheme  implemented  on  a 
♦-CPU  DEC  VAX  8350.  The  applications  include  parallel 
OPS5.  logic  simulation,  and  a  VSLI  wire  routing  program.-^ 
Among  the  important  issues  addressed  in  this  paper  are  the 
amount  of  sharing  in  user  programs  and  in  the  operating  sys-  ,  ' 
tern,  comparing  the  characteristics  of  user  and  system  refers 
ence  patterns,  sharing  related  to  process  migration,  and  the 
temporal,  spatial,  and  processor  locality  of  shared.bfocks  We 
also  analyze  the  impact  of  shared  references  otf  cache  coher¬ 
ence  m  shared-memory  multiprocessors. 


1  Introduction 

Although  we  now  have  a  reasonably  good  understanding  of 
memory  system  design  for  uniprocessors,  very  little  is  under¬ 
stood  about  memory  system  design  for  multiprocessors.  A 
major  reason  for  this  has  been  the  lack  of  real  data  about 
memory  reference  patterns  for  multiprocessors,  because  of 
the  difficulty  of  tracing  such  machines.  The  problem  of  get¬ 
ting  realistic  trace  data  is  even  more  acute  if  one  wishes  to 
the  study  the  effects  of  operating  system  references,  process 
migration,  and  other  such  real  system  events.  This  paper 
attempts  to  correct  this  situation  and  analyzes  memory  ref¬ 
erence  patterns  of  several  parallel  applications  running  under 
the  MACH  operating  system  on  a  shared-memory  multipro¬ 
cessor.  The  address  traces  used  in  our  study  were  obtained 
from  a  4-processor  VAX  8350  multiprocessor  using  an  ex¬ 
tended  version  of  the  ATUM  [l]  address  tracing  technique. 
These  traces  contain  both  system  and  user  memory  refer¬ 
ences,  including  process  migration  information. 
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Analysis  of  shared- memory  reference  patterns  is  needed  to 
determine  the  most  suitable  organization  of  the  memory  hi¬ 
erarchy  in  multiprocessors.  For  example,  several  cache  con¬ 
sistency  algorithms  proposed  in  the  literature  are  based  on 
subtle  differences  in  the  expected  memory  reference  patterns; 
lacking  detailed  data,  the  benefits  of  one  scheme  over  another 
cannot  be  assessed  accurately.  While  some  previous  studies 
have  looked  at  shared-memory  reference  patterns,  e.g.,  [2]. 
they  did  not  fully  address  issues  such  as  the  temporal,  spatial, 
and  processor  locality  of  shared  data,  sharing  in  the  operating 
system,  and  the  impact  on  cache  consistency.  For  example, 
we  show  that  shared  references  display  a  significant  amount 
of  processor  locality.  The  average  number  of  read  and  write 
references  to  a  write-shared  block  before  a  remote  reference 
are  4  and  2  respectively.  This  locality  is  exploited  by  the 
write-back  class  of  cache  coherence  schemes  to  significantly 
reduce  the  cost  of  references  to  shared  data.  Another  surpris¬ 
ing  result  that  we  observed  for  shared  data  references  is  that 
the  total  bus  bandwidth  required  is  minimised  when  block 
size  is  4  bytes  and  increases  as  the  block  size  is  increased. 
We  also  observe  that  processor  migration  causes  a  large  in¬ 
crease  in  the  sharing  level  as  observed  by  the  caches,  which 
can  greatly  increase  cache  coherence  traffic  on  the  bus. 

This  pape'  is  organized  as  follows.  Section  2  presents  back¬ 
ground  information  about  the  ATUM  address  tracing  tech¬ 
nique,  the  applications  measured,  and  the  MACH  operating 
system.  Section  3  defines  our  multiprocessor  model  and  the 
terminology  used  throughout  the  paper.  Section  4  constitutes 
the  bulk  of  the  paper  and  is  devoted  to  analyzing  the  traces. 
This  section  characterizes  shared-memory  reference  patterns 
and  looks  at  the  impact  of  the  reference  characteristics  on 
cache  consistency  algorithms.  Specifically,  in  Section  4.1  we 
present  data  about  the  general  characteristics  of  the  traces, 
including  statistics  about  interlocked  instructions.  Section 
4.2  assesses  the  temporal  and  processor  locality  of  shared  ref¬ 
erences.  Section  4.3  focuses  on  how  the  memory  reference 
characteristics  affect  the  performance  of  various  cache  consis¬ 
tency  algorithms.  Section  5  concludes  the  paper. 


2  Background  and  Methodology 

Our  study  is  based  on  trace  analysis.  The  traces  are  obtained 
using  a  multiprocessor  extension  of  the  ATUM  tracing  scheme 
[]].  ATUM  stands  for  Address  Tracing  Using  Microcode  and 
works  as  follows:  During  the  execution  of  each  instruction, 
the  microcode  writes  out  the  memory  references 
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made  by  the  processor  to  a  portion  of  memory  reserved  for 
tracing.  In  the  multiprocessor  extension  of  ATUM.  each  ac¬ 
cess  to  trace  memory  is  interlocked  to  enable  the  microcode 
in  several  processors  to  write  their  references  to  this  memory. 
Thus  a  trace  contains  interleaved  address  streams  of  several 
processors.  The  traces  used  for  this  study  were  gathered  on 
a  4-CPU  VAX  8350  machine  running  the  MACH  operating 
system.  ATl'M  traces  are  ■complete”  in  that  they  capture 
all  operating  system  and  multiprogramming  activity.  Each 
trace  is  roughly  3.5  million  references  long.  In  addition  to 
addresses.  ATl'M  records  the  opcodes,  and  the  virtual-to- 
phvsical  translations  that  occur  during  translation-lookaside- 
buffer  (TLB  or  TB)  misses.  A  location  is  considered  shared 
when  it  is  referenced  by  more  than  one  CPU.  Because  differ¬ 
ent  processes  could  access  a  given  shared  location  with  differ¬ 
ent  virtual  addresses,  sharing  is  detected  by  translating  the 
various  virtual  addresses  of  a  shared  location  to  its  common 
physical  address. 

The  traces  used  in  this  paper  are  obtained  from  three  pro¬ 
grams:  POPS.  THOR,  and  PERO.  POPS  (3]  is  a  parallel 
implementation  of  a  rule-based  programming  language  called 
OPS5,  which  is  a  widely  used  languages  for  the  building  ex¬ 
pert  systems.  It  exploits  parallelism  at  a  fine  granularity  and 
makes  extensive  use  of  the  shared  memory  provided  by  the 
architecture.  THOR  is  a  parallel  implementation  of  a  logic 
simulator  done  by  Larry  Soule  at  Stanford  University.  The 
simulator  transforms  the  task  of  circuit  simulation  into  a  se¬ 
ries  of  node  evaluations,  where  each  node  corresponds  to  a 
device  in  the  circuit.  The  parallel  implementation  evaluates 
these  nodes  in  parallel,  while  taking  care  of  the  dependencies 
between  them.  PERO  is  a  parallel  VLSI  router  written  by 
Jonathan  Rose  at  Stanford  [4]. 

We  briefly  describe  the  MACH  operating  system,  since 
some  of  the  shared  references  in  the  traces  belong  to  it.  and 
also  because  the  programming  style  used  in  the  applications 
was  influenced  by  it.  MACH  is  a  multiprocessor  operating 
system  developed  at  Carnegie  Mellon  University.  It  is  binary 
compatible  with  Berkeley  Unix,  and  provides  several  new  fa¬ 
cilities  to  support  parallel  processing.  It  provides  facilities  for 
multiple  tasks  to  share  memory  permitting  the  exploitation 
of  very  fine  grained  parallelism.  All  three  programs  make  use 
of  multiple  tasks  that  share  memory  to  communicate  with 
each  other  and  to  share  information.  MACH  is  not  a  to¬ 
tally  symmetric  operating  system  in  that  kernel  interrupts 
are  handled  by  processor  zero.  This  causes  the  memory  ref¬ 
erence  pattern  of  processor  zero  to  be  different  from  that  of 
the  remaining  processors.  In  parallel  programs,  where  many 
tasks  are  performing  I/O.  the  high  level  of  OS  interrupts  can 
also  cause  excessive  process  migration.  Fortunately,  none  of 
the  programs  that  we  study  in  this  paper  do  very  much  I/O. 

Table  1  presents  general  trace  characteristics  for  the  three 
programs.  The  columns  denote  the  total  number  of  refer¬ 
ences.  instruction  references,  data  reads,  data  writes,  user 
and  system  references.  Instruction  and  data  references  are 
about  equal,  while  there  are  roughly  three  reads  to  every 
write.  About  12%  of  all  references  are  system. 

The  ATUM  traces  used  for  this  study  do  have  some  lim¬ 
itations.  The  machine  used  had  only  4  CPUs  and  it  is  not 
dear  how  to  extend  the  results  to  a  larger  numbeT  of  proces¬ 
sors.  Work  on  this  issue  is  in  progress.  Another  problem  is 
the  unavailability  of  a  large  number  of  applications,  but  the 
number  is  growing. 


Table  1:  Summary  of  trace  characteristics.  All  numbers  are 
in  thousands 
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pops 

3142 

1624 

1257 

261 

2817 
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THOR 

3222 

1456 

1398 

368 

2727 

495 

PERO 

3508 

1834 

1266 

409 

3242 

266 

3  Multiprocessor  Model  and 
Definitions 

The  multiprocessor  model  we  assume  for  our  analyses  in  this 
paper  is  quite  straightforward.  We  assume  that  the  system 
consists  of  several  processors  each  with  its  own  cache  memory. 
The  caches  are  connected  to  a  common  system  bus  on  which 
shared  main  memory  is  located.  We  also  make  the  simplifying 
assumption  that  caches  are  infinite  in  size,  since  we  would  like 
to  concentrate  on  traffic  caused  due  to  shared  data  and  not 
mix  it  up  with  traffic  due  to  limited  cache  size. 

We  introduce  some  nomenclature  to  help  explain  memory 
access  patterns.  A  block  is  the  unit  of  data  transfer  between 
the  cache  and  mam  memory.  For  the  rest  of  the  paper,  we 
assume  block  size  to  be  1  word  (4  bytes).  The  small  block  size 
is  chosen  so  that  the  reference  behavior  for  each  data  object 
can  be  derived.  However,  characterization  using  larger  block 
sizes  is  also  important  to  study  the  spatial  locality  of  shared 
objects,  and  is  dealt  with  in  Section  4.3. 

A  read- shared  block  is  one  that  is  shared  (accessed  by  mul¬ 
tiple  processors),  but  never  written  into.  A  write-shored  block 
is  one  that  is  shared,  and  both  read  and  written  into.  A  refer¬ 
ence  to  a  block  B  by  processor  « is  said  to  ping  if  the  previous 
reference  to  that  block  was  by  processor  j.  where  j  /  «.  We 
call  such  a  reference  a  pinging  reference.  Conversely,  a  refer¬ 
ence  to  a  block  B  by  processor  » is  said  to  cling  if  the  previous 
reference  to  that  block  was  also  by  processor  «.  Such  a  ref¬ 
erence  is  called  a  clinging  reference.  By  these  definitions,  a 
ping  can  only  occur  on  a  reference  to  a  shared  block.  Pings 
and  clings  to  a  block  are  determined  simply  by  keeping  track 
of  which  processor  last  referenced  a  block.  References  are 
read  references  or  write  references  depending  on  whether  the 
operation  performed  is  a  read  or  write.  The  state  of  a  block 
(clean/dirty)  is  determined  by  the  references  of  the  proces¬ 
sor  accessing  it  currently.  A  block  is  said  to  be  dirty  if  it 
has  been  written  into  after  the  previous  pinging  reference  to 
it.  Therefore,  a  block  always  starts  out  clean  after  a  pinging 
reference  to  it. 

The  notion  of  clings  and  pings  yields  useful  insights  on  how 
various  shared-memory  multiprocessor  architectures  would 
perform.  The  appealing  feature  of  clings  and  pings  is  that 
they  do  not  depend  on  implementation  details  such  as  cache 
sizes.  Assuming  a  local  cache,  clinging  read  references  never 
need  the  bus:  pinging  read  references  need  to  use  the  bus  only 
if  the  read  misses  or  if  the  block  is  dirty  in  another  cache. 
A  bus  transaction  must  occur  on  a  pinging  write  reference. 
In  the  ensuing  discussion  we  will  show  results  on  the  time 
intervals  between  such  clings  and  pings,  and  also  on  the  fre¬ 
quency  of  various  kinds  of  clings  and  pings.  The  time  interval 
plots  are  a  useful  method  of  depicting  the  temporal  locality 
of  shared-memory  references,  while  the  frequency  of  clings 
and  pings  is  a  method  of  showing  the  ‘■processor  locality”  of 
references  to  a  block.  Besides  spatial  locality  and  temporal 
locality,  the  form  of  locality  important  in  a  multiprocessor 
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context  is  processor  locality  -  the  tendency  of  a  processor 
to  access  a  block  repeatedly  before  an  access  from  another 
processor.  A  direct  impact  of  this  locality  is  noticed  in  the 
performance  of  various  cache  consistency  schemes,  which  ex¬ 
ploit  different  locality  patterns  in  references  to  read-shared  or 
write-shared  blocks.  Also  notice  that  a  high  temporal  local¬ 
ity  of  pinging  references  yields  a  low  processor  locality,  and 
negatively  impacts  the  performance  of  multiprocessor  caches. 

To  separate  the  effects  of  process  migration,  we  also 
present  numbers  for  process-migration-shared  blocks.  These 
are  blocks  accessed  from  processor  t  by  process  p,  and  also 
from  processor  j  jt  t  by  the  same  process  p.  On  the  other 
hand,  real- shared  blocks,  are  blocks  accessed  from  processor 
t  by  process  p,  and  also  from  processor  }  t  by  process  ?, 
where  4  ^  p  always  holds. 

It  is  useful  to  have  a  notion  of  time  in  the  context  of  multi¬ 
processor  execution.  Our  traces  contain  interleaved  memory 
accesses  by  the  various  processors  in  approximately  the  same 
order  they  occurred.  However,  the  exact  time  at  which  the 
reference  was  made  is  not  clear.  For  example,  if  the  pro¬ 
cessors  i.  j,  and  k  each  made  references  at  real  time  in¬ 
stants  t.  1  +  1.  and  so  on.  the  trace  might  have  references 
»r  ji.k«  tt+i.  ji+j.ti+i.  and  so  on,  where  the  order  of  the  t‘k 
references  of  the  4  processors  might  be  random  with  respect 
to  each  other.  The  traces  also  show  clusters  of  memory  ref¬ 
erences  by  the  same  processor,  and  the  time  interval  between 
references  by  the  same  processor  also  varies. 

Due  to  this  nature  of  the  reference  pattern,  we  will  not  try 
to  approximate  real  time.  Instead,  we  will  use  the  order  of 
occurrence  of  a  reference  in  the  trace  as  the  index  of  time.  So 
the  r‘h  reference  in  the  trace  is  considered  to  have  occurred  at 
time  r.1  The  paper  considers  several  cases  where  the  traces 
are  filtered  to  extract  specific  references  (e.g..  user),  and  to 
enable  comparisons,  the  time  index  used  for  a  reference  de¬ 
pends  on  its  index  in  the  original  trace.  For  example,  when  we 
filter  out  operating  system  references  while  studying  sharing 
in  the  user  address  space,  the  time  index  of  a  user  reference 
corresponds  to  its  position  in  the  unaltered  trace. 


4  Results  and  Analyses 

We  first  present  some  genera]  statistics  about  the  traces,  in¬ 
cluding  data  about  interlocked  instructions.  We  then  present 
statistics  about  temporal  and  processor  locality  found  in  the 
traces  when  only  user  references  are  included  and  there  is  no 
process  migration  sharing,  when  both  system  and  user  refer¬ 
ences  are  included,  and  when  the  effects  of  process  migration 
are  taken  into  account.  We  then  evaluate  three  different  cache 
coherence  schemes  on  the  basis  of  the  amount  of  traffic  they 
generate  on  a  shared  bus.  Unless  stated  otherwise,  we  assume 
infinite  caches  and  a  4-byte  block  sire. 


4.1  General  Statistics 

The  statistics  in  Table  2.  for  both  instructions  and  data  refer¬ 
ences  of  user  and  the  operating  system,  relate  to  the  number 

1  We  believe  that  fine  time  distinctions  are  not  significant  in  our 
study.  To  approximate  real  time,  one  can  keep  a  virtual  system 
time  incremented  by  one  unit  for  every  n  references  in  the  trace, 
where  n  is  the  number  of  processors.  In  other  words,  the  times 
specified  in  our  paper  can  be  divided  by  4  to  get  a  rough  idea  of 
the  real  time. 


of  unique  blocks  and  the  proportion  of  references  to  shared 
blocks  in  the  traces. 


Table  2:  Proportion  of  shared  references  and  unique  shared 
blocks  when  the  blocksize  is  4  bytes.  Both  instruction  and 
data  references  of  user  and  OS  are  included.  All  numbers  are 
in  thousands.  Block  size  is  4  bytes. 
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Table  3  gives  the  same  statistics,  but  only  for  data  refer¬ 
ences  of  both  user  and  the  operating  system.  In  addition. 
Table  3  presents  the  number  of  blocks  that  are  written.  Be¬ 
cause  the  instruction  space  is  usually  read-only,  it  can  be 
treated  specially  in  memory  management,  and  so  most  of  the 
statistics  presented  later  correspond  to  data  references  alone. 
Table  4  presents  the  same  statistics  for  user  data  references 
alone. 

When  both  user  and  the  operating  system  data  references 
are  considered,  the  ratio  of  shared  references  to  all  data  refer¬ 
ences  (averaged  over  all  three  traces)  is  0.25;  the  ratio  is  0.27 
when  only  user  data  references  are  considered.  We  see  that 
the  level  of  sharing  in  the  operating  system  is  only  slightly 
lower  than  in  user. 

These  traces  have  an  insignificant  amount  of  process- 
migration-related  sharing.  We  also  looked  at  some  other 
traces  for  the  same  applications  with  a  large  amount  of  pro¬ 
cess  migration,  and  the  levels  of  sharing  are  drastically  differ¬ 
ent  in  these  traces.  The  ratio  of  shared  to  total  is  0.9  for  user 
data  references  when  process  migration  is  high:  when  process 
migration  effects  are  excluded  (only  references  to  real-shared 
blocks  are  counted),  the  ratio  of  user  data  references  and  all 
data  references  falls  to  0.2. 

4.1.1  Statistics  for  Interlocked  Instructions 

The  VAX  architecture  provides  seven  interlocked  instructions 
lor  synchronization.  These  are:  BBSSI  -  branch  on  bit  set 
and  set  interlocked:  BBCCI  -  branch  on  bit  clear  and  dear  in¬ 
terlocked:  ADAW1  -  add  aligned  word  interlocked;  INSQHI. 
IN'SQTI.  REMQBI.  REMQT1  -  four  instructions  to  manipu¬ 
late  linked  lists  (queues)  in  an  interlocked  manner.  The  usage 
of  these  instructions  is  presented  in  Table  5,  with  separate 
numbers  given  for  operating  system  code  and  user  code. 

Table  5  shows  that  only  BBSSI  and  BBCCI  instructions 
occur  in  the  trace.  The  ADAWI  instruction  is  used  in  the 
POPS  code,  although  it  does  not  occur  in  the  instruction 
references  that  our  trace  contains.  These  statistics  show  the 
strong  preference  of  programmers  to  use  the  simpler  testfcset 
type  instructions  for  synchronization,  rather  than  using  the 
more  complex  queue  manipulation  instructions. 

The  number  of  interlocked  instructions  as  a  fraction  of  all 
instruction  references  is  0.191-1. 6%  for  the  three  programs. 
While  the  fraction  is  as  high  as  1.2%-1.6%  for  POPS  and 
THOR,  the  fraction  is  only  0.1%  for  PERO.  The  reason  is 
simply  that  the  author  of  PERO  had  made  an  explidt  deci¬ 
sion  not  to  use  locks  for  the  most  frequently  used  data  struc¬ 
ture,  thus  trading  the  quality  of  the  final  solution  for  extra 
performance.  Since  executing  an  interlocked  instruction  may 
be  as  much  as  10-20  times  more  expensive  than  an  ordinary  r 
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Table  3:  Proportion  of  shared  reference*  and  unique  shared  data  blocks  when  the  blocksize  is  4  bytes.  Only  data  references  to 
both  user  and  OS  are  included.  All  numbers  are  in  thousands. 
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Table  4:  Proportion  of  shared  references  and  unique  shared  data  block*  when  the  blocksize  is  one  word  (4  bytes).  Only  data 
references  of  user  are  included.  All  numbers  are  in  thousands. 
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instruction  on  some  multiprocessors,  a  small  percentage  of 
interlocked  instructions  can  consume  a  large  percentage  of 
total  execution  time.  We  also  note  that  most  of  the  inter¬ 
locked  instructions  result  from  the  user  code  and  not  from 
the  operating  system  code. 

4.2  Temporal  and  Processor  Locality  of 
User  Data  References 

This  section  deals  with  dynamic  memory  access  patterns  and 
characterizes  the  temporal  and  processor  locality  of  real- 
shared  user  data  references.  The  first  few  figures  plot  the 
cumulative  distributions  and  the  frequency  distributions  of 
the  time  intervals  between  clinging  and  pinging  references  to 
demonstrate  the  temporal  locality  of  data  references.  All  fig¬ 
ures  use  a  block  size  of  4  bytes. 

Figure  1(a)  shows  the  cumulative  frequency  distribution 
of  the  time  interval  between  clinging  references  to  a  shared 
block.  In  other  words,  a  point  (z,jf)  on  a  curve  means  that 
y  references  occur  to  a  block  with  the  time  interval  between 
these  references  not  more  than  z.  The  corresponding  fre¬ 
quency  distribution  plot  for  one  of  these  programs  is  also 
shown  in  Figure  1(b).  Due  to  the  wide  range  of  time  inter¬ 
vals  in  which  the  references  occur,  the  bins  on  the  X-axis 
increase  in  powers  of  two.  Therefore  a  bar  at  z  with  height 
y  in  the  frequency  plot,  implies  that  y  references  occur  to  a 
block  with  an  interval  t  such  that  x  <  t  <  2*.  For  brevity  we 
plot  the  frequency  distributions  only  for  THOR. 

The  average  interval  of  time  between  accesses  to  the  same 
shared  block  is  1165  time  units  in  THOR.  This  number  is 
unusually  large  because  even  one  reference  with  a  very  large 
interval  (or  an  outlier)  can  skew  the  average  towards  large 
values.  Therefore,  in  the  context  of  time  intervals,  a  more 
interesting  number  is  the  median,  or  the  time  interval  over 
which  half  the  clinging  references  occur.  It  is  easy  to  see  that 
over  50%  of  the  intervals  are  25  time  units  or  less  in  THOR. 
(The  much  larger  average  is  due  to  the  bias  brought  in  by  a 
few  outliers.)  Not  surprisingly,  these  numbers  indicate  that 
blocks  are  re-referenced  at  small  intervals  of  time,  which  is 
simply  a  reconfirmation  of  the  fact  that  memory  references 
display  a  high  temporal  locality,  and  is  the  precise  reason  why 
caching  is  successful.  The  values  at  4K-8K  time  units  form 
a  second  peak  (Figure  1(b)).  although  the  height  is  much 
smaller  than  the  first  peak  at  16-32  time  units.  This  second 
peak  can  be  explained  as  clinging  references  that  occur  when 


the  process  resumes  execution  on  the  same  processor  after  be¬ 
ing  switched  out.  The  first  peak,  clearly,  is  due  to  references 
within  a  context  switch  interval.  The  height  of  the  second 
peak  is  much  larger  in  traces  that  show  significant  process 
migration.  This  low  temporal  locality  component  of  clinging 
references  introduced  by  process  migration  can  be  deleterious 
to  cache  performance. 

These  results  are  compared  with  those  for  pinging  refer¬ 
ences.  or  for  a  reference  to  a  block  by  a  processor  followed 
by  a  reference  from  another  processor.  Figure  2(a)  shows 
the  cumulative  distribution,  and  Figure  2(b)  the  frequency 
distribution.  The  time  intervals  in  this  case  are  interest¬ 
ingly  lower  than  for  clinging  references,  which  says  that  ref¬ 
erences  to  shared  blocks  by  different  processors  are  usually  at 
least  as  finely  interleaved  as  references  by  the  same  processor. 
Doubtlessly,  the  fact  that  our  applications  exploit  parallelism 
at  a  fine  granularity  is  the  cause  of  the  high  temporal  locality. 

The  small  second  peak  at  256  time  units  in  Figure  2(b)  is 
due  to  the  process  migrating  to  another  processor  following  a 
context  switch.  If  the  level  of  process  migration  is  high,  this 
peak  at  a  large  time  interval  can  become  much  taller,  which 
falsely  suggests  that  process  migration  lowers  the  temporal 
locality  of  shared  references.  In  reality,  process  migration 
simply  makes  a  large  fraction  of  the  logically  private  blocks 
appear  shared,  and  it  is  references  to  these  shared  blocks 
alone  that  give  rise  to  the  tall  second  peak. 

Our  analysis  also  shows  that  roughly  a  fourth  of  the  data 
references  are  to  shared  data.  However,  a  large  part  of 
the  shared  references  need  not  generate  bus  traffic  because 
in  most  multiprocessor  architectures,  the  large  number  of 
clinging  references  to  shared  blocks  (especially  reads)  can  be 
treated  in  much  the  same  manner  as  references  to  private 
blocks,  in  other  words,  blocks  can  be  treated  as  private  dur¬ 
ing  large  windows  of  time. 

The  previous  figures  did  not  distinguish  between  read  and 
write  references.  Making  this  distinction  is  necessary  because 
in  many  high-performance  multiprocessor  architectures,  only 
pinging  references  to  dirty  blocks  cause  bus  traffic  when  the 
new  value  of  the  dirty  block  must  be  somehow  transmitted  to 
the  requesting  processor.  Figure  3  show*  the  distribution  of 
the  time  interval  between  pinging  references  to  a  dirty  block. 
The  total  number  of  pinging  references  to  dirty  blocks  is  far 
less  than  all  the  pinging  references.  As  we  shall  show  later 
in  our  discussion  on  cache  consistency  performance,  sophisti¬ 
cated  cache  management  schemes  that  take  advantage  of  such 
features  can  have  significant  advantages  over  simpler  schemes. 
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Table  5:  Interlocked  instruction  statistics.  Sole  the  number*  are  not  in  thousands. 
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Figure  2:  Distribution  of  the  time  interval  between  pinging  references  to  a  block.  Only  real- shared  data  references  of 
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Comparing  Figures  2(b)  and  3(b),  we  see  that  the  peak 
around  the  time  interval  4-8  in  Figure  2(b)  is  caused  by  ref¬ 
erence  to  read-shared  objects.  Because  Figure  3(b)  does  not 
show  this  early  peak,  we  believe  that  references  to  write- 
shared  blocks  have  less  temporal  locality  than  references 
to  read-shared  blocks,  which  benefits  multiprocessor  caches. 
A  possible  case  is  the  test-and-test&set  synchronisation  se¬ 
quence.  where  one  might  expect  multiple  reads  from  several 
processors,  but  less  frequent  writes.  The  low  temporal  lo¬ 
cality  in  pinging  references  to  dirty  blocks  encourages  us  to 
believe  that  for  large  time  periods  blocks  can  be  considered 
as  private  and  no  traffic  need  be  generated  in  maintaining 
consistent  caches. 

As  caches  grow  bigger,  blocks  are  expected  to  stay  in  the 
cache  for  long  periods  of  time.  In  such  a  situation,  a  bet¬ 
ter  characterization  uses  the  notion  of  processor  locality.  (A 
similar  characterization  has  also  been  used  in  [5]).  We  will 
address  processor  locality  in  two  ways.  The  first  looks  at  the 
number  of  references  to  a  block  before  a  pinging  references 
to  it.  and  the  second  looks  at  the  number  of  references  to  a 
block  before  a  pinging  reference  to  it.  given  that  at  least  one 
of  the  references  was  a  write.  Each  of  the  above  two  char¬ 
acterizations  is  pertinent  to  some  cache  consistency  scheme. 
For  example,  the  first  one  indicates  the  potential  of  a  cache 
consistency  scheme  that  allows  only  one  cached  copy  of  a 
block. 

Figure  4(a)  shows  the  cumulative  distribution  of  the  num¬ 
ber  of  references  to  a  block  before  a  pinging  reference,  and 
Figure  4(b)  the  frequency  distribution.  In  Figure  4(b)  for 
THOR,  there  are  about  200.000  pinging  references  to  a  block 
referenced  only  once  by  the  previous  processor.  Unlike  in 
the  distributions  of  time  intervals,  where  we  used  the  median 
as  a  measure  of  temporal  locality,  here  the  average  is  more 
indicative  of  processor  locality,  because  outliers  represent  a 
large  number  of  references,  and  must  be  weighted  accordingly. 
The  low  average  of  1.3  indicates  that  interleaved  references 
by  different  processors  t.e  as  frequent  as  clinging  references, 
implying  low  processor  locality.  We  evaluated  a  cache  con¬ 
sistency  scheme  that  allowed  only  one  cached  copy  of  any 
block  [6],  and  it  performed  abysmally  for  this  very  reason. 


One  of  the  chief  differences  between  some  of  the  snoop¬ 
ing  cache  consistency  schemes  is  the  way  they  treat  write 
references.  One  set  of  schemes,  e.g..  DRAGON  [7]  or  FIRE¬ 
FLY  [8],  allow  caches  to  hold  valid  copies  of  blocks  that  are 
being  written  into  by  others,  and  update  the  values  on  writes. 
Another  set  of  schemes  prefer  to  allow  only  one  copy  of  a  writ¬ 
ten  block  (e.g.,  Berkeley  Ownership  [9],  or  various  flavors  of 
directory  schemes  [6]).  The  performance  of  one  or  the  other 
method  is  predicated  on  the  locality  of  references  to  write- 
shared  blocks,  which  we  address  next. 

Figure  5  shows  the  number  of  read  and  write  references  - 
at  least  one  reference  a  write  -  before  a  pinging  reference. 
Several  observations  can  be  made  from  this  figure.  First,  the 
average  number  of  references  to  write-shared  blocks  by  the 
same  processor  before  a  pinging  reference  is  5.6  for  POPS.  3.6 
for  THOR,  and  7.5  for  PERO.  Write  references  are  relatively 
fewer  than  reads  and  contribute  1.6.  1.7,  and  1.2  respectively 
to  these  averages.  These  averages  indicate  that  the  processor 
locality  of  shared-writable  blocks  is  higher  than  that  of  read- 
shared  blocks.  (Recall  that  the  corresponding  numbers  for 
all  references  were  1.8,  1.3.  and  2.5).  The  higher  processor 
locality  indicates  that  a  shared  written  datum  is  accessed 
multiple  times  by  a  processor  before  being  relinquished. 

A  more  important  observation  from  Figure  5  is  that  the 
total  number  of  these  pings  axe  approximately  an  order  of 
magnitude  lower  than  all  pinging  references,  which  lessens  the 
adverse  impact  of  the  low  processor  locality  of  write  references 
on  the  performance  of  cache  consistency  schemes. 

As  noted  earlier,  the  average  number  of  writes  to  a  block 
before  a  pinging  reference  is  small  (1.7  for  THOR);  there  are 
several  possible  reasons  for  this  low  value.  We  expect  a  low 
value  for  references  caused  by  spinlocks.  We  also  expect  this 
value  to  be  low  for  shared  objects  which  move  from  one  pro¬ 
cessor  to  another,  with  each  processor  making  some  modifica¬ 
tions  to  the  object.  Also  mostly-read-only  objects  are  written 
once,  and  then  numerous  pinging  read  references  axe  made  by 
other  processors. 

Thus  far,  we  saw  that  the  processor  locality  of  shared- 
references  is  moderate,  with  roughly  2  writes  and  4  reads 


Figure  4:  Distribution  of  the  number  of  references  to  «  block 
user  included. 
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on  average  to  write-shared  objects  before  a  pinging  reference 
Therefore,  a  good  cache  consistency  scheme  must  ensure  ef¬ 
fective  handling  of  repeat  read-references  to  shared  blocks. 
Given  the  moderate  processor  locality  of  shared-data,  we  can¬ 
not  directly  determine  whether  invalidating  cache  consistency 
schemes  such  as  the  Berkeley  Ownership  protocol  or  directory 
schemes,  or  the  updating  protocols  snch  as  the  Dragon  and 
Firefly  schemes  are  superior.  More  detailed  evaluation  that 
takes  into  account  the  cost  of  updating  versus  invalidating 
must  be  undertaken  to  make  a  decision. 

4.2.1  Sharing  Characteristics  of  Both  User  and 
OS  References 

The  following  discussion  focuses  on  the  sharing  characteris¬ 
tics  of  both  user  and  system  references,  where  instruction 
references  are  excluded,  as  before.  The  general  observation  is 
that  the  sharing  characteristics  of  user  and  system  are  not  sig¬ 
nificantly  different,  although  the  temporal  locality  of  shared 
system  references  was  slightly  lower,  and  the  processor  local¬ 
ity  was  slightly  higher. 

For  the  times  between  clinging  references  in  POPS,  THOR, 
and  PERO.  the  medians  occurred  at  26,  2'.  and  2T772  for  user 
and  system,  while  the  corresponding  numbers  for  user  alone 
were  23.  25,  and  28186.  The  times  between  pinging  references 
were  different  by  roughly  the  same  ratio,  while  the  times  be¬ 
tween  pinging  references  to  dirty  blocks  showed  greater  vari¬ 
ation.  The  medians  for  user  and  system  were  438,  2095,  and 
12446,  as  compared  to  363,  1779,  and  19711  for  user  alone. 

The  processor  locality  metrics  also  showed  only  small  dif¬ 
ferences  from  the  case  of  user  references  alone.  In  general, 
for  the  user  and  system  references  the  average  number  of  ref¬ 
erences  to  a  block  before  a  pinging  reference  were  roughly 
5%  greater.  A  similar  trend  was  observed  for  the  number  of 
references  to  write-shared  blocks. 


4.2.2  Effects  of  Process  Migration 

Since  the  three  traces  we  have  discussed  so  far  do  not  show  a 
significant  amount  of  process  migration,  we  used  three  other 
traces  of  the  same  applications  that  did.  Due  to  space  con¬ 
straints  we  will  only  summarize  our  findings  here  and  details 
are  presented  in  [10], 

The  temporal  locality  of  clinging  references  decreases  if 
processes  are  rescheduled  on  the  same  processor,  after  having 
run  on  another  processor  (it  will  show  up  as  a  large  increase  in 
the  height  of  the  second  peak  in  Figure  1(b)).  One  component 
of  cache  interference  caused  by  migration  is  similar  to  the 
interference  caused  by  context  switching. 

Perhaps  the  most  important  effect  of  process  migration  is 
the  significant  increase  in  the  number  of  blocks  that  get  phys¬ 
ically  shared  by  several  processors,  although  the  logical  shar¬ 
ing  in  the  program  might  be  much  smaller.  For  instance,  the 
fraction  of  references  to  shared  data  blocks  increases  from 
0.2  to  0.9  with  process  migration.  Due  to  the  typically  long 
intervals  between  process  switches  (thousands  of  references), 
the  time  interval  between  pinging  references  to  these  shared 
blocks  is  very  large,  and  causes  a  much  larger  second  peak  in 
Figure  2(b).  Similarly,  the  average  number  of  references  to  a 
block  -  at  least  one  reference  being  a  write  -  before  a  pinging 
reference  is  13  with  process  migration  and  less  than  2  with¬ 
out.  This  perceived  decrease  in  the  temporal  locality  and  the 
increase  in  processor  locality  of  shared  references  stems  from 
the  fact  that  many  of  these  references  are  to  logically  private 
data  objects  that  are  not  referenced  by  other  processors  until 
the  process  actually  migrates  to  another  processor. 

In  summary,  although  process  migration  increases  the  pro¬ 
cessor  locality  and  decreases  the  temporal  locality  of  shared 
blocks,  it  increases  the  total  number  of  shared  blocks  sub¬ 
stantially,  and  potentially  impacts  both  intrinsic  cache  per¬ 
formance.  and  the  performance  of  cache  consistency  schemes 
adversely. 


i 


M  M 

Figure  5:  Distribution  of  tbe  number  of  references  to  a  block  before  a  pinging  reference  to  the  same  block,  given  that  at  least 
one  reference  was  a  write.  Only  real-shared  data  references  of  user  are  included. 


4.3  Cache  Consistency  Implications 

Tbe  memory  reference  traces  also  yield  useful  insights  about 
tbe  effectiveness  of  various  cache  consistency  schemes.  For 
example,  they  enable  an  accurate  determination  of  the  traffic 
caused  on  a  shared  bus  by  any  given  cache  consistency  scheme 
under  realistic  load  conditions.  While  a  detailed  analysis  of 
tbe  numerous  cache  consistency  schemes  proposed  in  litera¬ 
ture  [11.9.7.12,8]  would  be  interesting,  it  is  beyond  the  scope 
of  this  paper.  Instead,  we  consider  one  representative  each 
from  tbe  write-through  with  invalidate,  write-back  with  inval¬ 
idate.  and  write-back  with  update  classes  of  cache  coherence 
schemes.  To  help  explain  the  various  phenomena  observed 
here,  we  use  the  data  presented  in  earlier  sections.  As  before 
we  assume  infinite  caches,  and  unless  otherwise  stated,  block 
size  is  one  word  (or  four  bytes). 

Tbe  first  scheme  discussed  in  this  paper  is  wnte-throvgh 
with  mvalidate  (WTI)  commonly  used  in  commercial  mul¬ 
tiprocessors.  In  this  scheme,  every  write  from  a  processor 
accesses  the  bus  both  to  update  main  memory  and  to  in¬ 
validate  that  location  in  other  caches.  Examples  of  write¬ 
back  with  invalidate  schemes  are  Goodman's  write-once  [ll], 
Rudolph  and  Segall  s  scheme  [12].  Berkeley  Ownership  [9], 
and  the  directory  scheme  [13].  We  consider  write-once  as  the 
second  scheme  in  this  paper.  In  this  scheme,  the  first  write 
to  a  location  uses  the  bus  to  update  main  memory  and  to  in¬ 
validate  that  location  in  other  caches.  Subsequent  writes  to 
that  location  by  the  same  processor  do  not  result  in  any  bus 
traffic,  as  that  location  is  now  owned  locally.  This  scheme  is 
labeled  WB1  in  the  following  discussion  to  indicate  the  class 
it  belongs  to.  Examples  of  the  write-hock  with  update  schemes 
are  Dragon  [7]  and  Firefly  [8].  We  use  Dragon  as  the  third 
scheme,  and  denote  it  WBU.  In  the  Dragon  scheme,  all  writes 
to  a  shared  location  (a  location  present  in  multiple  caches) 
result  in  a  bus  access  to  update  the  value  of  that  location  in 
other  caches.  For  non-shared  locations,  the  cache  acts  like  a 
regular  uniprocessor  write-back  cache. 

We  evaluate  the  performance  of  the  above  three  cache  co¬ 
herence  schemes  in  terms  of  the  bus  transactions  generated 
on  a  shared-memory  multiprocessor.  We  distinguish  between 


three  kinds  of  bus  transactions:  block  transfers ,  updates,  and 
invalidations.  A  block  transfer  transaction  transfers  a  block 
from  memory  to  cache  or  vice  versa.  For  example,  a  block 
transfer  into  a  cache  on  a  read  miss.  An  update  transaction 
updates  the  contents  of  a  location  either  in  main  memory 
(e.g.,  on  a  processor  write  in  WTI)  or  in  a  remote  cache  (e.g., 
on  a  write  to  a  shared  location  in  WBU).  The  update  trans¬ 
fers  only  one  word,  and  is  hence  cheaper  than  a  block  transfer 
with  a  large  block  size.  A  processor  uses  an  invalidation  to 
purge  cache  blocks  in  other  caches  to  get  exclusive  ownership 
of  the  block.  No  data  transfer  is  required  for  this  transaction, 
only  the  address  of  the  cache  block  to  be  invalidated  need  be 
specified.  Note  that  block  transfers  and  updates  can  simulta¬ 
neously  serve  as  invalidation  transactions,  and  this  is  usually 
exploited  in  most  coherence  schemes. 

Table  6  presents  the  event  frequencies  for  the  three  traces 
as  a  function  of  the  cache  coherence  strategy.  Because  of  our 
interest  in  characteristics  of  shared  references,  we  only  include 
epu-shared  user  data  references  for  POPS.  THOR,  and  PERO 
(see  Table  4  for  details).  Because  caches  are  infinite,  a  data 
item  brought  into  the  cache  remains  there  until  invalidated. 
From  Table  6  we  derive  tbe  total  number  of  block  transfer 
transactions  and  update  transactions  that  would  occur  in  a 
multiprocessor  and  present  the  numbers  in  Table  7.  The  table 
also  presents  data  for  16-byte  and  64-byte  blocks  to  study 
spatial  locality  in  shared  references. 

We  first  examine  Table  7  for  4-byte  blocks.  Comparing 
total  number  of  transactions,  the  WTI  scheme  is  worse  than 
both  WBI  and  WBU.  WTI  looses  to  WBI  because  of  the 
processor  locality  displayed  by  write  references,  as  shown  in 
Figure  5.  While  every  write  generates  bus  traffic  in  WTI, 
clinging  write  references  do  not  cause  bus  traffic  in  WBI. 
Comparing  WTI  and  WBU,  both  schemes  generate  an  update 
transaction  for  every  write  to  a  shared  location.  However, 
WBU  saves  about  25%  updates  because  before  the  point  that 
a  location  becomes  shared  (a  second  processor  requests  it), 
only  the  first  read  or  write  produces  a  bus  transaction.  WBU 
also  has  fewer  block  transfers  because,  unlike  WTI.  it  never 
invalidates  a  location  from  a  cache.  The  details  of  the  events 
are  in  Table  6. 
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Table  6:  Events,  bus  transactions,  and  event  frequencies.  Each  event  is  a  triple:  event-type  (read-miss,  write-miss,  write-hit), 
state  in  local  cache  (not  present,  clean,  dirty),  and  state  in  remote  cache  (not  present,  de.ji.  dirty).  We  use  abbreviations  d 
for  block  transfer,  s  for  update,  and  i  for  invalidate.  Only  cpu-shared  user  data  references  are  considered.  All  numbers  are  i- 
thousandt. 
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Dividing  the  total  number  of  bus  transactions  generated  by 
all  three  programs  for  the  WBI  scheme  in  Table  7  (161.6K) 
by  the  total  number  of  references  that  resulted  in  these  trans¬ 
actions  (1168.7K),  we  see  that  there  are  approximately  0.138 
bus  transactions  generated  per  reference.  This  number  ap¬ 
pears  quite  large  given  infinite  caches,  and  there  are  two  rea¬ 
sons  for  this.  First,  this  data  represents  only  cpu-shared  user 
data  references,  which  show  poor  processor  locality  as  in  Fig¬ 
ure  4,  or  equivalently,  which  display  a  high  temporal  locality 
of  pinging  references  as  in  Figure  2).  Consequently  they  do 
not  benefit  much  from  the  read-sharing  allowed  by  the  WBI 
scheme.  If  one  includes  both  user  and  OS  references,  and 
both  data  and  instructions,  then  the  number  of  transactions 
per  reference  falls  to  0.031,  which  is  much  better.  This  re¬ 
duction  is  primarily  due  to  the  large  number  of  read-shared 
references  generated  by  instruction  fetches.  (Consequently, 
allowing  read  sharing  for  instructions  is  crucial  in  multipro¬ 
cessor  caches.)  The  second  reason  for  the  high  value  is  that 
block  size  is  4  bytes.  When  the  block  size  is  increased  to  16 
bytes,  the  number  of  transactions  per  reference  drops  down 
further  to  0.016.  primarily  due  to  the  high  spatial  locality  of 
instruction  fetch  references. 

In  general,  two  opposing  forces  come  into  play  as  the  block 
size  is  increased  -  one  trying  to  decrease  the  number  of  trans¬ 
actions  and  the  other  trying  to  increase  them.  As  the  block 
size  is  increased  the  number  of  bus  transactions  is  reduced 
because  the  bus  access  or  invalidation  cost  is  amortized  over 
several  words.  Contrarily,  a  large  block  size  increases  the 
probability  of  unrelated  objects  residing  in  the  same  block, 
and  a  write  to  one  object  can  unnecessarily  invalidate  an  ac¬ 
tive  unrelated  object  in  a  remote  cache. 

To  study  the  spatial  locality  characteristics  of  cpu-shared 
user  data  references,  we  now  examine  the  bus  transactions 
generated  by  WBI  in  Table  7  as  the  block  size  is  increased. 
For  POPS  the  number  of  block  transfers  decreases  from 
91.86K  to  47.15K  to  46.23K  as  the  block  size  is  increased 
from  4  to  16  to  64  bytes.  This  indicates  that  there  is  high 
spatial  locality  at  16-bytes,  with  little  cache  interference  due 
to  coresiding  unrelated  objects.  Beyond  16  bytes,  either  there 


is  no  spatial  locality  or  the  cache  interference  neutralizes  the 
benefits  due  to  locality.  THOR  behaves  differently.  When 
the  block  size  is  increased  from  4  to  16  bytes,  the  number  of 
block  transfers  increases  by  a  factor  of  1.5.  This  indicates  that 
negative  cache  interference  effects  dominate.3  In  contrast  to 
POPS  and  THOR,  increasing  block  size  has  a  very  positive 
effect  on  PERO.  The  number  of  block  transfers  decrease  by 
a  factor  of  2  as  the  block  size  is  increased  from  4  to  16  bytes, 
and  further  by  a  factor  of  3.4  when  the  block  size  is  increased 
from  16  to  64  bytes.  The  number  of  npdate  transactions  de¬ 
creases  steadily  too.  Thus  the  PERO  program  appears  to 
have  high  spatial  locality  with  almost  no  cache  interference. 

Another  interesting  result  that  can  be  observed  by  exam¬ 
ining  the  total  traffic  lines  in  Table  7  is  that  for  shared  data 
references  the  total  bus  bandwidth  required  is  minimized 
when  block  size  is  4  bytes  and  increases  as  the  block  size 
is  increased.  This  result  is  in  start  contrast  to  uniprocessor 
caches,  where  the  optimal  block  size  tends  to  be  much  larger. 
The  only  exception  is  the  PERO  program  when  block  size 
equals  64  bytes. 

We  were  interested  in  estimating  the  effects  of  obviating 
broadcasts  in  cache  consistency  schemes  to  enable  scalability. 
Table  8  presents  the  number  of  caches  in  which  blocks  are 
actually  invalidated,  whesevet  a  reference  that  could  poten¬ 
tially  invalidate  other  caches  is  processed  in  the  WBI  scheme. 
Such  references  for  the  WBI  scheme  are  all  write  misses  and 
all  write-hits  to  a  dean  location  in  the  local  cache.  The  to¬ 
tal  number  of  such  references  is  given  in  column  three.  The 
inv-0  column  gives  the  number  of  potentially  invalidating  ref¬ 
erences  that  resulted  in  no  actual  invalidations,  the  inv-1  col¬ 
umn  gives  the  number  of  such  references  that  resulted  in  ex¬ 
actly  one  invalidation,  the  inv-2  column  gives  the  number 
that  resulted  in  an  invalidation  in  two  other  caches,  and  the 
inv-3  column  denotes  an  invalidations  in  three  other  caches. 
Since  all  the  traces  are  four-processor  traces,  no  reference  can 
result  in  invalidation  in  more  than  three  other  caches. 

3  Another  factor  contributing  to  the  increased  number  of  block 
transfers  is  the  fact  that  as  block  size  '*  increased,  the  number  of 
cpu-shared  reference*  also  increases. 
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Table  7:  But  transactions.  Only  cpu-shared  data  reference*  of  user  ire  included.  AU  numbers  ire  in  thousands. 
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We  would  like  to  remirk  on  two  ispects  of  the  diti  pre¬ 
sented  in  Tible  8:  the  friction  of  references  that  invilidite 
multiple  ciches  is  compered  to  those  thit  invilidite  only  one 
ciche.  ind  the  efTect  of  changing  the  ciche  block  size.  Let  us 
eximine  the  first  ispect.  The  diti  for  4-byte  blocks  indi- 
cites  thit  the  friction  of  references  thit  ciuse  inviliditions 
in  three  ciches  (1.3%)  is  quite  smill  compered  to  the  friction 
thit  ciuse  inviliditions  in  one  ciche  (61.0%).'  It  is  interest¬ 
ing  to  speculite  if  this  phenomenon  -  thit  on  in  invilidite 
trinsiction,  with  high  probibility,  diti  in  only  one  or  very 
few  ciches  needs  to  be  invalidated  -  is  true  even  when  the 
number  of  processors  is  lirge.  If  it  is  true,  then  insteid  of 
building  broadcist-bised  ciche  consistency  mechinisms,  one 
cm  build  messige-bised  mechinisms  where  the  invilidition 
messige  is  sent  only  to  those  ciches  thit  contain  thit  diti. 
The  resulting  reduction  in  bindwidth  requirements  mikes  it 
possible  to  build  scalable  shared-memory  multiprocessors.  In 
the  following  paragraphs,  we  speculate  why  the  above  result 
should  also  hold  for  a  larger  number  of  processors. 

There  ire  three  kinds  of  data  objects  in  parallel  programs: 
(i)  non-shared,  (ii)  read-shared,  and  (iii)  write-shared  objects. 
The  non-shared  objects  normally  do  not  cause  any  invalida¬ 
tions  except  due  to  process  migration,  in  which  case  all  the 
invalidations  go  only  to  the  processor  that  previously  ran  that 
process.  The  read-shared  objects  also  do  not  cause  my  in¬ 
validations.  So  the  multiple  cache  invalidations  come  from 
write-shared  objects.  We  now  explore  some  common  ways  in 
which  write-shared  objects  are  used  in  parallel  programs. 

The  first  common  use  of  write-shared  objects  is  as  spin 
locks  or  other  similar  synchronization  related  structures.  Let 
us  consider  the  spin  lock  as  the  typical  case.  If  the  spin  lock 
is  implemented  in  a  straightforward  way  using  an  interlocked 
testlcset  instruction,  since  the  instruction  ends  in  a  write,  at 
the  end  of  each  instruction  only  one  cache  contains  the  data, 
and  only  one  cache  has  to  be  invalidated  on  a  subsequent  ref¬ 
erence  by  a  different  processor.  If  the  spin  lock  is  implemented 
using  a  test-and-test&set  instruction.3 4  then  with  some  prob¬ 
ability  the  lock  will  be  present  in  multiple  caches.  When  the 
lock  is  set  free  by  writing  into  it,  these  multiple  caches  have 
to  be  invalidated.  However,  if  the  program  is  “reasonable” 
(i.e.,  there  is  no  excessive  contention  for  the  locked  object), 

3  The  reason  why  thit  ratio  is  smaller  for  POPS  and  THOR  for 
larger  block  sizes  is  discussed  later. 

4  In  a  test- and- tesUt set  instruction,  if  the  first  test  fails  we 
•imply  loop  back  and  do  not  execute  the  testfcset  part  of  the 
instruction. 


then  either  the  lock  will  not  have  too  many  processes  wait¬ 
ing  on  it  and  thus  only  one  or  a  few  caches  will  need  to  be 
invalidated,  or  such  an  occurrence  will  be  very  rare,  md  the 
probability  of  invalidating  mmy  caches  will  be  very  small. 

The  second  common  use  of  write  shared  objects  is  as 
mostly-read-only  objects.  An  example  is  multiple  programs 
sharing  a  database  that  is  occasionally  modified.  By  occasion¬ 
ally  we  mean  that  relative  to  the  number  of  references  made 
to  that  object,  the  number  of  writes  is  small.  On  a  write  to 
a  mostly-read-only  object,  multiple  caches  may  have  to  be 
invalidated,  but  since  writes  are  rare,  the  overall  fraction  of 
multiple  cache  invalidations  still  stays  low.  The  third  com¬ 
mon  use  of  write-shared  objects  is  where  one  process  works 
on  an  object  for  some  time,  then  another  process,  and  so  on. 
Shared  objects  protected  by  locks  often  behave  this  way.  In 
this  third  case,  when  one  process  is  working  on  an  object, 
that  object  resides  in  the  cache  of  the  associated  processor. 
When  that  object  moves  to  another  process  (and  possibly  to 
another  processor),  the  cache  entries  in  the  previous  proces¬ 
sor  are  invalidated,  but  that  corresponds  to  invalidation  in 
only  one  other  cache.  So  it  is  still  consistent  with  our  con¬ 
jecture  that  in  larger  multiprocessors  invalidations  will  hap¬ 
pen  in  only  one  or  in  a  very  small  number  of  other  caches 
with  high  probability.  The  above  observations  suggest  the 
use  of  a  message-based  cache  consistency  protocol,  instead  of 
a  broadcast-based  protocol.  We  are  analyzing  this  issue  in 
detail  and  results  will  be  presented  in  a  future  paper. 

We  now  look  at  the  effect  of  increasing  the  cache  block 
size  on  the  number  of  invalidations.  The  fraction  of  refer¬ 
ences  that  cause  invalidations  in  multiple  caches  increases 
with  block  size.  As  an  example,  for  POPS,  consider  dividing 
the  entries  in  the  inv-3  column  by  corresponding  entries  in 
the  total  column  in  Table  8.  The  numbers  we  get  are  2.1%, 
4.6%,  and  6.2%  respective'v.  The  primary  reason  for  this 
phenomenon  is  that  as  block  size  is  increased,  unrelated  data 
objects  fall  into  the  same  cache  block.  Multiple  processors 
accessing  these  distinct  objects  cache  the  same  block,  and  a 
subsequent  write  results  in  an  invalidation  in  multiple  caches. 

5  Summary  and  Conclusions 

We  have  presented  data  characterizing  the  memory  refer¬ 
ence  patterns  in  shared-memory  multiprocessors.  Our  data 
is  based  on  traces  obtained  for  three  applications  from  a  4- 
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Table  8:  Cache  invalidation  statistics  (or  the  WBI  coherence 
scheme.  Only  user  epu-shared  data  references  are  included. 
All  numbers  are  in  thousands. 
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processor  VAX  8350  using  the  ATUM  address  tracing  tech¬ 
nique.  The  traces  used  are  ’‘complete',  in  that  they  contain 
information  about  both  system  and  user  references,  references 
due  to  interrupts,  process  scheduling,  etc. 

Our  analyses  shows  that  a  large  fraction  (about  one-fourth) 
of  references  in  the  traces  are  to  shared  objects.  These  shared 
references  display  a  significant  amount  of  temporal  locality, 
and  only  a  small  amount  of  processor  locality  for  both  read 
and  write  references.  For  example,  the  average  number  of 
reads  and  writes  to  a  write-shared  block  before  a  remote  ref¬ 
erence  (a  ping,  which  may  possibly  invalidate  the  data)  are 
4  and  2  respectively.  Nevertheless,  caching  shared  data  is 
still  highly  useful  because  of  the  significant  amount  of  read 
sharing. 

We  also  present  statistics  about  the  use  of  interlocked  in¬ 
structions.  The  traces  show  that  0.1%-1.6%  of  instruction 
references  are  to  interlocked  instructions,  and  that  most  of 
these  instructions  references  are  from  user  code.  The  paper 
also  touches  on  the  effects  of  process  migration.  Process  mi¬ 
gration  causes  a  large  number  of  logically  unshared  references 
to  become  shared  references  with  respect  to  the  cache  system. 

The  nature  of  shared-memory  reference  patterns  also  yields 
insight  on  how  various  cache  consistency  schemes  will  per¬ 
form.  We  present  the  analysis  for  three  classes  of  cache 
consistency  schemes  -  write-through  with  invalidate  (WTI), 
write-back  with  invalidate  (WBI),  and  write-back  with  up¬ 
date  (WBU ).  For  shared  data  references.  WTI  performs  worse 
than  both  WBI  and  WBU  as  it  uses  the  bus  on  every  write. 
Comparing  WBI  and  WBU,  the  former  seems  to  have  an  edge 
for  4-byte  blocks,  while  WBU  does  better  for  16-bvte  and  64- 
bvte  blocks.  Another  surprising  result  that  we  observed  for 
shared  data  references  is  that  the  total  bus  bandwidth  re¬ 
quired  is  minimized  when  block  size  is  4  bytes  and  increases 
as  the  b'oek  size  is  increased.  Our  traces  also  show  that  when 
a  reference  that  could  possibly  invalidate  a  cache  is  processed, 
with  a  very  high  probability  (61.0  %)  it  invalidates  only  one 
other  cache.  The  probability  of  causing  an  invalidation  in  all 
three  caches  is  only  1 .3%.  We  discuss  why  this  should  also  be 
true  for  multiprocessors  with  larger  number  of  processors,  and 
suggest  the  use  of  message-based  cache  consistency  schemes 
rather  than  broadcast- based  cache  consistency  schemes. 
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